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Abstract. We present an implementation of model-based online rein- 
forcement learning (RL) for continuous domains with deterministic tran- 
sitions that is specifically designed to achieve low sample complexity. To 
achieve low sample complexity, since the environment is unknown, an 
agent must intelligently balance exploration and exploitation, and must 
be able to rapidly generalize from observations. While in the past a num- 
ber of related sample efficient RL algorithms have been proposed, to al- 
low theoretical analysis, mainly model-learners with weak generalization 
capabilities were considered. Here, we separate function approximation 
in the model learner (which does require samples) from the interpolation 
in the planner (which does not require samples). For model-learning we 
apply Gaussian processes regression (GP) which is able to automatically 
adjust itself to the complexity of the problem (via Bayesian hyperpa- 
rameter selection) and, in practice, often able to learn a highly accu- 
rate model from very little data. In addition, a GP provides a natural 
way to determine the uncertainty of its predictions, which allows us to 
implement the "optimism in the face of uncertainty" principle used to 
efficiently control exploration. Our method is evaluated on four common 
benchmark domains. 



1 Introduction 

In reinforcement learning (RL), an agent interacts with an environment and 
attempts to choose its actions such that an externally defined performance mea- 
sure, the accumulated per-step reward, is maximized over time. One defining 
characteristic of RL is that the environment is unknown and that the agent has 
to learn how to act directly from experience. In practical applications, e.g., in 
robotics, obtaining this experience means having a physical system interact with 
the physical environment in real time. Therefore, RL methods that are able to 
learn quickly and minimize the amount of time the robot needs to interact with 
the environment until good or optimal behavior is learned, are highly desirable. 

In this paper we are interested in online RL for tasks with continuous state 
spaces and smooth transition dynamics that are typical for robotic control do- 
mains. Our primary goal is to have an algorithm which keeps sample complexity 
as low as possible. 



1.1 Overview of the contribution 



To maximize sample efficiency, we consider online RL that is model-based in the 
spirit of RM AX [3] , but extended to continuous state spaces similar to |1I10I5| . 
As in RMAX and related methods, our algorithm, GP-RMAX, consists of two 
parts: a model- learner and a planner. The model-learner estimates the dynam- 
ics of the environment from the sample transitions the agent experiences while 
interacting with the environment. The planner is used to find the best possible 
action, given the current model. As the predictions of the model-learner become 
increasingly more accurate, the actions derived become increasingly closer to 
optimal. To control the amount of exploration, the "optimism in the face of un- 
certainty" principle is employed which makes the agent visit unexplored states 
first. In our algorithm, the model- learner is implemented by Gaussian process 
(GP) regression; being non-parametric, GPs give us enhanced modeling flexibil- 
ity. GPs allow Bayesian model selection and automatic relevance determination. 
In addition, GPs provide a natural way to determine the uncertainty of predic- 
tions, which allows us to implement the "optimism in the face of uncertainty" 
exploration of RMAX in a principled way. The planner uses the estimated tran- 
sition function (as estimated by the model) to solve the Bellman equation via 
value iteration on a uniform grid0 

The key point of our algorithm is that we separate the steps estimating a 
function from samples in the model-learner from solving the Bellman equation 
in the planner. The rationale behind this is that, if the transition function is 
relatively simple, it can be estimated accurately from only few sample transi- 
tions. On the other hand, the optimal value function, due to the inclusion of 
the max operator, often is a complex function with sharp discontinuities. Solv- 
ing the Bellman equation, however, does not require actual "samples" ; instead, 
we must only be able to evaluate the Bellman operator in arbitrary points of 
the state space. This way, when the transition function can be learned from 
only a few samples, large gains in sample efficiency are possible. Competing 
model-free methods, such as fitted Q-iteration |18I8I15) or policy iteration based 
LSPI/LSTD/LSPE [12141131111 . do not have this advantage, as they need the 
actual sample transitions to estimate and represent the value function. 

Conceptually, our approach is closely related to Fitted R-MAX, which was 
proposed in [TU] and uses an instance-based approach in the mo del- learner, and 
related work in [511] . which uses grid-based interpolation in the model-learner. 
The primary contribution of this paper is to use GPs instead. Doing this means 
we are willing to trade off theoretical analysis with practical performance. For 
example, unlike the recent ARL [T], for which PAC-stylc performance bounds 
could be derived (because of its grid-based implementation of model- learning) , 
a GP is much better able to handle generalization and as a consequence can 
achieve much lower sample complexity. 



While certainly more advanced methods exist, e.g., [9 14] , for our purpose here, a 
uniform grid is sufficient as proof of concept. 



1.2 Assumptions and limitations 

Our approach makes the following assumptions (most of which are also made in 
related work, even if it is not always explicitly stated) : 

- Low dimensionality of the state space. With a uniform grid, the number 
of grid points for solving the Bellman equation scales exponentially with 
the dimensionality. While more advanced methods, such as sparse grids or 
adaptive grids, may allow us to somewhat reduce this exponential increase, at 
the end they do not break the curse of dimensionality. Alternatively, one can 
use nonlinear function approximation; however, despite some encouraging 
results, it is unclear as to whether this approach would really do any better 
in general applications. Today, breaking the curse of dimensionality is still 
an open research problem. 

- Discrete actions. While continuous actions may be discretized, in practice, 
for higher dimensional action spaces this becomes infeasible. 

- Smooth transition junction. Performing an action from states that are "close" 
must lead to successor states that are "close". (Otherwise both the gener- 
alization in the model learner and the interpolation in the value function 
approximation would not work). 

- Deterministic transitions. This is not a fundamental requirement of our ap- 
proach, since GPs can also learn noisy functions (either due to observation 
noise or random disturbances with small magnitude), and the Bellman op- 
erator can be evaluated in the resulting predictive distribution. Rather it is 
one taken for convenience. 

- Known reward function. Assuming that the reward function is known and 
only the transition function needs to be learned is what is different from most 
comparable work. While it is not a fundamental requirement of our approach 
(since we could learn the reward function as well), it is an assumption that 
we think is well justified: for one, reward is the performance criterion and 
specifies the goal. For the type of control problems we consider here, reward 
is always externally defined and never something that is "generated" from 
within the environment. Two, reward sometimes is a discontinuous function, 
e.g., +1 at the goal state and elsewhere. Which makes it not very amenable 
for function approximation. 

2 Background: Planning when the model is exact 

Consider the reinforcement learning problem for MDPs with continuous state 
space, finite action space, discounted reward criterion and deterministic dynam- 
ics [19] . In this section we assume that dynamics and rewards are available to the 
learning agent. Let state space X be a hyperrectangle in R d (this assumption is 
justified if, for example, the system is a motor control task), A be the finite ac- 
tion space (assuming continuous controls are discretized), x t +i — f(x t ,a t ) be the 
transition function (assuming that continuous time problems are discretized in 
time), and r(x, a) be the reward function. For the following theoretical argument 




Fig. 1. Bilinear interpolation to determine Q(f(£i,a),a') in R 2 . 



we require that both transition and reward function are Lipschitz continuous in 
the actions; i.e., there exist constants Lf,L r such that \\f(x,a) — f(x',a)\\ < 
Lf \\x — x'\\, and \r(x, a) — r(x' , a)\ < L r \\x — x'\\, Vx, x' € X, a S A.. In addi- 
tion, we assume that the reward is bounded, \r(x, a)\ < i?MAX, Vx, a. Note that 
in practice, while the first condition, continuity in the transition function, is usu- 
ally fulfilled for domains derived from physical systems, the second condition, 
continuity in the rewards, is often violated (e.g. in the mountain car domain, 
reward is in the goal and —1 everywhere else). Despite that we find that in 
many of these cases the outlined procedure may still work well enough. 

For any state x, we are interested in determining a sequence of actions 
ao, <xi, £l2) • • • such that the accumulated reward is maximized, 

oo 

V*(x):= max \ y^7*r(x t , a t ) \ x = x,x t +i = f{xt, a t ) \, 

a Q .ai,... L^— » J 

t=0 

where < 7 < 1. Using the Q-notation, where Q*(x, a) := r(x, a)+jV*(f(x, a)), 
the optimal decision policy tt* is found by first solving the Bellman equation in 
the unknown function Q, 

Q(x, a) = r(x, a) + 7max Q(f(x, a), a') \tx<^X, a^A (1) 

a' 

to yield Q* , and then choosing the action with the highest Q-value, 

7T* (x) = argmax Q*(x, a'). 

a' 

The Bellman operator T related to |T]) is defined by 

(TQ)(x,a) := r(x,a) + -/max Q{f(x,a),a'). (2) 



It is well known that T is a contraction and Q* the unique bounded solution to 
the fixed point problem Q(x, a) = (TQ){x, a), \/x, a. 



In order to solve the infinite dimensional problem in (|TJ) numerically, we 
have to reduce it to a finite dimensional problem. This is done by introducing 
a discretization r of X into a finite number of elements, applying the Bellman 
operator to only the nodes and interpolating in between. 

In the following we will consider a uniform grid J), with N vertices & and 
(i-dimensional tensor B-spline interpolation of order 1. The solution of ([1]) is 
then obtained in the space of piecewise affine functions. 

For a fixed action a', the value Q rh (z,a') of any state z with respect to 
grid Jft can be written as a convex combination of the vertices £j of the grid 
cell enclosing z with coefficients Wij (see Figure QJ,). For example, consider the 
2-dimensional case (bilinear interpolation) in Figure QJ>. Let z = (x,y) G M 2 . 
To determine Q rh (z,a'), we find the four vertices £00, £01 j £iOj £11 & R 2 of the 
enclosing cell with known function values g^o := Q rh (£.oo,a,'), ■ ■ . etc. We then 
perform two linear interpolations along the ^-coordinate (order invariant) in the 
auxilary points xq, x\ to obtain 

Q rh (x ,a') = (1 - \ a )q%' + \ q°[ 

Q rh {x u a!) = (1-AoKo + Ao5u 

where Ao := d x /h x (see Figure [lb for a definition of d x ,h x ,Xo,Xi), We then 
perform another linear interpolation in xq, X\ along the y-coordinate to obtain 

Q Fh (z,a') = (l-A 1 )(l-A )^; + (l-A 1 )A ^i + A 1 (l-A K + A 1 A ^; (3) 

where Ai := d y /h y . Weights u)y now correspond to the coefficients in Q. An 
analogous procedure applies to higher dimensions. 

Let Q a ' be the Nxl vector with entries [Q a ')i = Q Fh (S,i, a'). Let zf,...,z% € 
M. d denote the successor state we obtain when we apply the transition function 
/ to vertices £j using action a, i.e., zf :— f(£i,a). Let [wf]j — wfj denote the 
1 x N vector of coefficients for zf from (J3j> . The Q-value of zf for any action a' 
with respect to grid Fh can thus be written as Q Fh (zf,a') — J2f=ii w i]j[Q a ']j- 
Let W a with rows [wf] be the N x N matrix of all coefficients. (Note that this 
matrix is sparse: each row contains only 2 d nonzero entries). 

Let R a be the Nxl vector of associated rewards, [R a ]i ■= r(^, a). Now we 
can use ^fy to obtain a fixed point equation in the vertices of the grid J^, 

Q rh (^,a)^{T rh Q rh ){^,a) 1 = 1, . . . , N, a = 1, . . . , |^|, (4) 

where 

(T r *Q r *)(&,a) := r(&,a) +7 max Q A (/te, a), a'). 

Slightly abusing the notation, we can write this more compactly in terms of 
matrices and vectors, 

T Fh Q Fh :=R a + 7 max {\Y a Q a '\ Wa. (5) 

a' 

The Q-function is now represented by |^4| iV-dimensional vectors Q a , each con- 
taining the values for the vertices £j. The discretized Bellman operator T Fh is 
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Fig. 2. High-level overview of the GP-RMAX framework 



a contraction in R d x A and therefore has a unique fixed point Q* € R d x A. 
Let function Q* ,Fh : (R d x A) — > 1R be the Q-function obtained by linear in- 
terpolation of vector Q* along states. The function Q* ,Fh can now be used to 
determine (approximately) optimal control actions: for any state x S X, we 
simply determine 

T7*' Fh (x) — argmax Q*' Fh (x, a'). 

a' 

In order to estimate how well function Q* ,Fh approximates the true Q* , a poste- 
riori estimates can be defined that are based on local errors, i.e. the maximum 
of residual in each grid cell. The local error in a grid cell in turn depends on the 
granularity of the grid, h, and the modulus of continuity Lf, L g (e.g., see [91 14) 
for details). 



3 Our algorithm: GP-RMAX 

In the last section we have seen how, for a continuous state space, optimal 
behavior of an agent can be obtained in a numerically robust way, given that 
the transition function Xt+i = f(xt,at) is knownd 

For model-based RL we are now interested in solving the same problem for 
the case that the transition function is not known. Instead, the agent has to 
interact with the environment, and only use the samples it observes to compute 
optimal behavior. Our goal in this paper is to develop a learning framework 

2 Remember our working assumption: reward as a performance criterion is externally 
given and does not need to be estimated by the agent. Also note that discretization 
(even with more advanced methods like adaptive or sparse grids) is likely to be 
feasible only in state spaces with low to medium dimensionality. Breaking the curse 
of dimensionality is an open research problem. 



where this number is kept as small as possible. This will be done by using the 
samples to learn an estimate f(x,a) of f(x,a) and then use this estimate / in 
place of / in the numerical procedure outlined in the previous section. 



3.1 Overview 

A sketch of our architecture is shown in Figure [2j GP-RMAX consists of the 
two parts model learning and planning which are interwoven for online learning. 
The model-learner estimates the dynamics of the environment from the sample 
transitions the agent experiences while interacting with the environment. The 
planner is used to find the best possible action, given the current model. As the 
predictions of the model-learner become increasingly more accurate, the actions 
derived from the planner become increasingly closer to optimal. Below is a high- 
level overview of the algorithm: 

Input: 

• Reward function r(x, a) 

• Discount factor 7 

• Performance parameters: 

* planning and model-update frequency K 

* model accuracy Sf 1 , S^ 1 (stopping criterion for model-learning) 

* discretization of planner ./V 
Initialize: 

• Model Aii, Q-function Q\, observed transitions T>\ 

- Loop: t = 1,2,... 

• Interact with system: 

* observe current state xt 

* choose action a t greedy with respect to Q t 

a t = argmaxQ t (a; t ,a') 

a' 

* execute action at, observe next state Xt+i, store transition V t+ i — 
T> t U {x t ,a t ,x t+ i} 

• Model learning: (see Section l3T2"j) 

* only every K steps, and only if Ait is not sufficiently exact (as de- 
termined by evaluating the stopping criterion) 

• Ait+i — update.model (Ait, 2?t+i) 

• evaluate_stopping_criterion (Ait+i,Ait, 8^ , 5^) 

* else 

• Mt+i = Ai t 

• Planning with model: (see Section l3~3| 

* only every K steps, and only if Ait is not sufficiently exact (as de- 
termined by evaluating the stopping criterion) 

• Qt+i = augmented_value_iteration (Q u , Ait+i, @r(x, u), 7, N) 

* else 

• Qt+i = Qt 

Next, we will explain in more detail how each of the two functional modules 
"model-learner" and "planner" is implemented. 



3.2 Model learning with GPs 



In essence, estimating / from samples is a regression problem. While in the- 
ory any nonlinear regression algorithm could serve this purpose, we believe that 
GPs are particularly well-suited: (1) being non-parametric means great model- 
ing flexibility; (2) setting the hyperparameters can be done automatically (and 
in a principled way) via optimization of the marginal likelihood and allows au- 
tomatic determination of relevant inputs; and (3) GPs provide a natural way to 
determine the uncertainty of its predictions which will be used to guide explo- 
ration. Furthermore, uncertainty in GPs is supervised in that it depends on the 
target function that is estimated (because of (2)); other methods only consider 
the density of the data (unsupervised) and will tend to overexplore if the target 
function is simple. 

Assume we have observed a number of transitions, given as triplets of state, 
performed action, and resulting successor state, e.g., V = {xt, at, £t+i}{=i,2,... 
where Xt+i = f(xt,at). Note that / is a d-dimensional function, f{xt,at) = 
[fi(xt, a t ), ■ ■ ■ , fd(xt, o,t)\ T ■ Instead of trying to estimate / directly (which cor- 
responds to absolute transitions), we try to estimate the relative change Xt+\ — Xt 
as in [TU] . The effect of each action on each state variable will be treated indepen- 
dently: we train multiple univariate GPs and combine the individual predictions 
afterwards. Each individual G'Pij is trained in the respective subset of data in 
V, e.g., QVij is trained on all Xt as input, and x^h — x\^ as output, where 
at — j. Each individual QT\j has its own set of hyperparameters obtained from 
optimizing the marginal likelihood. 

The details of working with GPs can be found in [17] ; using GPs to learn 
a model for RL was previously also studied in [8] (for offline RL and without 
uncertainty-guided exploration). One characteristic of GPs is that their func- 
tional form is given in terms of a parameterized covariance function. Here we 
use the squared exponential, 

k(x, x'; v ,b, 9) = vo expj— 0.5(x ~ x') T Q(x — a;') j + b, 

where matrix fl is either one of the following: (1) Q = 01 (uniform), (2) 
fl = diag(#i, . . . , Od) (axis aligned ARD), (3) fl — M^Mj (factor analysis). 
Scalars vo,b and the (i?-dependent number of) entries of 9 constitute the hy- 
perparameters of the GP and are adapted from the training data (likelihood 

3 There is also the problem of implementing GPs efficiently when dealing with a pos- 
sible large number of data points. For the lack of space we can only sketch our 
particular implementation, see [16] for more detailed information. Our GP imple- 
mentation is based on the subset of regressors approximation. The elements of the 
subset are chosen by a stepwise greedy procedure aimed at minimizing the error in- 
curred from using a low rank approximation (incomplete Cholesky decomposition). 
Optimization of the likelihood is done on random subsets of the data of fixed size. 
To avoid a degenerate predictive variance, the projected process approximation was 
used. 



optimization). Note that variant (2) and (3) implement automatic relevance de- 
termination: relevant inputs or linear projections of inputs are automatically 
identified, whereby model complexity is reduced and generalization sped up. 

Once trained, for any testpoint x, QVij provides a distribution over target 
values, J\f(fj,ij(x),afj(x)), with mean fJ,ij(x) and variance crfj(x) (exact formulas 
for fi and a can be found in [17j). Each individual mean fj^j predicts the change 
in the i-th coordinate of the state under the j-th action. Each individual variance 
afj can be interpreted as the associated uncertainty; it will be close to if QVij 
is certain, and close to k(x, x) if it is uncertain (the value of k(x, x) depends on 
the hyperparameters of QVij). Stacking the individual predictions together, our 
model-learner produces in summary 



f(x, a) 











+ 


_Hda(x)_ 



c(x,a) := max (normalize^ (of a )) , (6) 

i— 1,...,<2 



where f(x,a) is the predicted successor state and c{x,a) the associated un- 
certainty (taken as maximum over the normalized per-coordinate uncertainties, 
where normalization ensures that the values lie between and 1). 



3.3 Planning with a model 

At any time t, the planner receives as input model Mf For any state x and 
action a, model Mt can be evaluated to "produce" the transition f(x,a) along 
with normalized scalar uncertainty c(x,a) G [0,1], where means maximally 
certain and 1 maximally uncertain (see Section 13.21) 

Let /ft, be the discretization of the state space X with nodes i = 1, . . . , N. 
We now solve the planning stage by plugging / into the procedure described in 
Section [21 First, we compute zf = f(£i,a), c(£j,a) from and the associated 
interpolation coefficients wfj from ([3]) for each node ^ and action a. 

Let C a denote the N x 1 vector corresponding to the uncertainties, [C a ]i = 
c(£i,a); and R a be the N x 1 vector corresponding to the rewards, [R a ]i — r(^, a). 
To solve the discretized Bellman equation in Eq. ((4]), we perform basic Jacobi 
iteration: 

- Initialize [Qq];, i = 1, . . . , N, a = 1, . . . , | A\ 

- Repeat for k = 0, 1, 2, . . . 

[Qg+ 1 ]i = [« B ]i+7nwx|^<[Qg'] i | V*,a (7) 

until <3fc +1 — Qt\oc < t°h or a maximum number of iterations is reached. 

To reduce the number of iterations necessary, we adapt Grime's increasing 
coordinate algorithm [9] to the case of Q-functions: instead of Eq. (JJJ, we perform 



updates of the form 




[Q a k+1 ]i = [1 - 7<]~ 1 [R% + 7 max V wUQi\ 3 }}. (0) 



13 1 

In [3] it was proved that Eq. ((TQ) converges to the same fixed point as Eq. ([7]) , and 
it was empirically demonstrated that convergence can occur in significantly fewer 
iterations. The exact reduction is problem-dependent, savings will be greater for 
small 7 and large cells where self-transitions occur (i.e., £j is among the vertices 
of the cell enclosing zf). 

To implement the "optimism in the face of uncertainty" principle, that is, to 
make the agent explore regions of the state space where the model predictions 
are uncertain, we employ the heuristic modification of the Bellman operator 
which was suggested in [T5] and shown to perform well. Instead of Eq. (0]) , the 
update rule becomes 

[Qt+ih = (1 - [C a h)[l - 7<r X I [R a ]i + 7 max J £ W %[Crf] 

V " I i 

+ [C a ]^MAX (0') 

where Vmax := -Rmax/(1 — 7). Eq. ([7T|) can be seen as a generalization of 
the binary uncertainty in the original RMAX paper to continuous uncertainty; 
whereas in RMAX a state was either "known" (sufficiently explored), in which 
case the unmodified update was used, or "unknown" (not sufficiently explored), 
in which case the value ^max was assigned, here the shift from exploration to 
exploitation is more gradual. 

Finally we can take advantage of the fact that the planning function will be 
called many times during the process of learning. Since the discretization is 
kept fixed, we can reuse the final Q- values obtained in one call to plan as initial 
values for the next call to plan. Since updates to the model often affect only 
states in some local neighborhood (in particular in later stages), the number of 
necessary iterations in each call to planning will be further reduced. 

A summary of our model-based planning function is shown below. 

Input: 

• Model M t , initial [Qo]i, i = 1, • • • , N, a = 1, . . . , \A\ 
Static inputs: 

• Grid ZTj with nodes £1, . . . , £/v, discount factor 7, reward function r(x, a) 
evaluated in nodes giving [R a ]i 

Initialize: 

• Compute zf = /(&,a) and [C% from M t (see Eq. ©) 

• Compute weights w° for each zf (see Eq. ©) 

- Loop: 

• Repeat update Eq. (|7I'[) until — Qk\oo < tol, Va, or the maximum 
number of iterations is reached. 



4 Experiments 



We now examine the online learning performance of GP-RMAX in various well- 
known RL benchmark domains. 

4.1 Description of domains 

In particular, we choose the following domains (where a large number of com- 
parative results is available in the literature): 

Mountain car: In mountain car, the goal is to drive an underpowered car from 
the bottom of a valley to the top of one hill. The car is not powerful enough to 
climb the hill directly, instead it has to build up the necessary momentum by 
reversing throttle and going up the hill on the opposite side first. The problem 
is 2-dimensional, state variable X\ G [—1.2,0.5] describes the position of the car, 
X2 G [—0.07,0.07] its velocity. Possible actions are a G { — 1,0, +1}. Learning is 
episodic: every step gives a reward of —1 until the top of the hill at x\ > 0.5 
is reached. Our experimental setup (dynamics and domain specific constants) is 
the same as in |19j . with the following exceptions: maximal episode length is 500 
steps, discount factor 7 = 0.99 and every episode starts with the agent being at 
the bottom of the valley with zero velocity, x s tart = ( — 7r/6, 0). 

Inverted pendulum: The next task is to swing up and stabilize a single-link 
inverted pendulum. As in mountain car, the motor does not provide enough 
torque to push the pendulum up in a single rotation. Instead, the pendulum 
needs to be swung back and forth to gather energy, before being pushed up and 
balanced. This creates a more difficult, nonlinear control problem. The state 
space is 2-dimensional, 9 G [— tt,tt] being the angle, 9 G [—10,10] the angular 
velocity. Control force is discretized to a G {—5, —2.5, 0, +2.5, +5} and held 
constant for 0.2sec. Reward is defined as r{x,a) := — O.lx 2 — O.Olx 2 , — 0.01a 2 . 
The remaining experimental setup (equations of motion and domain specific 
constants) is the same as in [6] . The task is made episodic by resetting the system 
every 500 steps to the initial state x s tart = (0, 0). Discount factor 7 = 0.99. 

Bicycle: Next we consider the problem of balancing a bicycle that rides at a 
constant speed [5],[I2]- The problem is 4-dimensional: state variables are the 
roll angle u> G [-127r/180, 127r/180], roll rate Co G [-27r,27r], angle of the handle 
bar a G [-807r/180, 8O71-/I8O], and the angular velocity a G [-27r,27r]. The ac- 
tion space is inherently 2-dimensional (displacement of rider from the vertical 
and turning the handlebar); in RL it is usually discretized into 5 actions. Our 
experimental setup so far is similar to [8]. To allow a more conclusive compar- 
ison of performance, instead of just being able to keep the bicycle from falling, 
we define a more discriminating reward r(x, a) = —x\, and r(x,a) = —10 for 
|xi| < 127r/180 (bicycle has fallen). Learning is episodic: every episode starts in 
one of two (symmetric) states close to the boundary from where recovery is im- 
possible: x s tart = (107r/180, 0, 0, 0) or Xstart = (— 107r/180, 0, 0, 0), and proceeds 
for 500 steps or until the bicycle has fallen. Discount factor 7 = 0.98. 



Acrobot: Our final problem is the acrobot swing-up task [T5]- The goal is to 
swing up the tip of the lower link of an underactuated two-link robot over a 
given height (length of first link). Since only the lower link is actuated, this is a 
rather challenging problem. The state space is 4-dimensional: 9\ 6 [— 7r, 7r], 9\ £ 
[— 4tt,Att], 62 £ [— ti", Ti"], #2 £ [— 97T, 97r]. Possible actions are a G {— 1, +1}. Our 
experimental setup and implementation of state transition dynamics is similar 
to [15]. The objective of learning is to reach a goal state as quickly as possible, 
thus r(x,a) = —1 for every step. The initial state for every episode is x sta rt = 
(0,0,0,0). An episode ends if either a goal state is reached or 500 steps have 
passed. The discount factor was set to 7 = 1, as in [19]. 

4.2 Results 

We now apply our algorithm GP-RMAX to each of the four problems. The 
granularity of the discretization Fh in the planner is chosen such that for the 2- 
dimensional problems, the loss in performance due to discretization is negligible. 
For the 4-dimensional problems, we ran offline trials with the true transition 
function to find the best compromise of granularity and computational efficiency. 
As result, we use a 100 x 100 grid for mountain car and inverted pendulum, a 
20 x 20 x 20 x 20 grid for the bicycle balancing task, and a 25 x 25 x 25 x 25 
grid for the acrobot. The maximum number of value iterations was set to 500, 
tolerance was < 10~ 2 . In practice, running the full planning step took between 
0.1-10 seconds for the small problems, and less than 5 min for the large problems 
(where often more than 50% of the CPU time was spent on computing the GP 
predictions in all the nodes of the grid). Using the planning module offline with 
the true transition function, we computed the best possible performance for each 
domain in advance. We obtained: mountain car (103 steps), inverted pendulum 
(-18.41 total cost), bicycle balancing (-3.49 total cost), and acrobot (64 steps)Q 

For the GP-based model-learner, we set the maximum size of the subset to 
1000, and ICD tolerance to 10~ 2 . The hyperparameters of the covariance were 
not manually tuned, but found from the data by likelihood optimiziation. 

Since it would be computationally too expensive to update the model and 
perform the full planning step after every single observation, we set the planning 
frequency K to 50 steps. To gauge if optimal behavior is reached and further 
learning becomes unnessecary, we monitor the change in the model predictions 
and uncertainties between successive updates and stop if both fall below a thresh- 
old (test points in a fixed coarse grid). 

We consider the following variations of the base algorithm: (1) GP-RMAXexp, 
which actively explores by adjusting the Bellman updates in Eq. (|71'[) according 
to the uncertainties produced by the GP prediction; (2) GP-RMAXgrid, which 
does the same but uses binary uncertainty by overlaying a uniform grid on top 
of the state-action space and keeping track which cells are visited; and (3) GP- 
RMAXnoexp, which does not actively explore (see Eq. ([7f])). For comparison, 

4 Note that 64 steps is not the optimal solution, 2 demonstrated swing- up with 61 
steps. 



we repeat the experiments using the standard online model-free RL algorithm 
Sarsa(A) with tile coding [T!5], where we consider two different setup of the tilings 
(one finer and one coarser). 

Figure [3] shows the result of online learning with GP-RMAX and Sarsa. In 
short, the graphs show us two things in particular: (1) GP-RMAX learns very 
quickly; and (2) GP-RMAX learns a behavior that is very close to optimal. In 
comparison, Sarsa(A) has a much higher sample complexity and does not always 
learn the optimal behavior (exception is the acrobot). While direct compari- 
son with other high performance RL algorithms, such as fitted value iteration 
[1818115] . policy iteration based LSPI/LSTD/LSPE [1214113111] . or other kernel- 
based methods [716] is difficult, because they are either batch methods or handle 
exploration in a more ad-hoc way, from the respective results given in the litera- 
ture it is clear that for the domains we examined GP-RMAX performs relatively 
well. 

Examining the plots in more detail, we find that, while GP-RMAXgrid 
is somewhat less sample efficient (explores more), GP-RMAXexp and GP- 
RMAXnoexp perform nearly the same. Initially, this appears to be in contrast 
with the whole point of RMAX, which is efficient exploration guided by the un- 
certainty of the predictions. Here, we believe that this behavior can be explained 
by the good generalization capabilities of GPs. Figure 0] illustrates model learn- 
ing and certainty propagation with GPs in the mountain car domain (predicting 
acceleration as function of state). The state of the model-learner is shown for 
two snapshots: after 40 transitions and after 120 transitions. The top row shows 
the value function that results from applying value iteration with the update 
modified for uncertainty, see Eq. (|7I'|) . The bottom row shows the observed sam- 
ples and the associated certainty of the predictions. As expected, certainty is 
high in regions where data was observed. However, due to the generalization 
of GPs and data-dependent hyperparameter selection, certainty is also high in 
unexplored regions; and in particular it is constant along the y-coordinate. To 
understand this, we have to look at the state transition function of the moun- 
tain car: acceleration of the car indeed only depends on the position, but not on 
velocity. This shows that certainty estimates of GPs are supervised and take the 
properties of the target function into account, whereas prior RMAX treatments 
of uncertainty are unsupervised and only consider the density of samples to de- 
cide if a state is "known" . For comparison, we also show what GP-RMAX with 
grid-based uncertainty would produce in the same situation. 

5 Summary 

We presented an implementation of model-based online reinforcement learning 
similar to RMAX for continuous domains by combining GP-based model learn- 
ing and value iteration on a grid. Doing so, our algorithm separates the problem 
function approximation in the model-learner from the problem function approx- 
imation/interpolation in the planner. If the transition function is easier to learn, 
i.e., requires only few samples relative to the representation of the optimal value 
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Fig. 3. Learning curves of our algorithm GP-RMAX (left column) and the standard 
method Sarsa(A) with tile coding (right column) in the four benchmark domains. Each 
curve shows the online learning performance and plots the total reward as a function of 
the episode (and thus sample complexity). The black horizontal line denotes the best 
possible performance computed offline. Note the different scale of the x-axis between 
GP-RMAX and Sarsa. 
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Fig. 4. Model-learning and propagation of "knownness" of state-action pairs with GPs. 
The top row shows the value function that results from applying value iteration with the 
update modified for uncertainty, see Eq. (|7I'|) . The bottom row shows the actual samples 
(red circles) and the induced uncertainty of all states: black is perfectly "known" , white 
is perfectly "unknown". Panels (a) and (b) show that with GPs certainty of model 
predictions is rapidly propagated through the whole state space, leading to strong 
generalization and targeted exploration. This in turn allows the optimal value function 
to be learned from very few sample transitions: panel (b) shows that after only 120 
transitions (still in the middle of the very first episode) the approximated value function 
already resembles the true one [15]. Panel (c) shows the same for a counter-based binary 
uncertainty; most of the grid cells are unvisited and the thus the approximate value 
function is zero in most parts of the state space. 



function, then large savings in sample-complexity can be gained. Related model- 
free methods, such as fitted Q-iteration, can not take advantage of this situation. 
The fundamental limitation of our approach is that it relies on solving the Bell- 
man equation globally over the state space. Even with more advanced discretiza- 
tion methods, such as adaptive grids, or sparse grids, the curse of dimensionality 
limits the applicability to problems with low or moderate dimensionality. Other, 
more minor limitations, concern the simplifying assumptions we made: deter- 
ministic state transitions and known reward function. However, these are not 
conceptual limitations but rather simplifying assumptions made for the present 
paper; they could be easily addressed in future work. 
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